String Metrics and Word Similarity applied to Information Retrieval

نویسندگان

Hao Chen

Qinpei Zhao

Olli Virmajoki

چکیده

Over the past three decades, Information Retrieval (IR) has been studied extensively. The purpose of information retrieval is to assist users in locating information they are looking for. Information retrieval is currently being applied in a variety of application domains from database systems to web information search engines. The main idea of it is to locate documents that contain terms the users specify in their queries. The thesis presents several string metrics, such as edit distance, Q-gram, cosine similarity and dice coefficient. All these string metrics are based on plain lexicographic term matching and could be applied to classical information retrieval models such as vector space, probabilistic, boolean and so on. Experiment results of string distance metrics on real data are provided and analyzed. Word similarity or semantic similarity relates to computing the similarity between concepts or senses of words, which are not lexicographically similar. WordNet, can be classified into two categories: one uses solely semantic links, the other combines corpus statistics with taxonomic distance. Five similarity measures belonging to these two categories are selected to conduct the experiment on the purpose of comparison. Hierarchical clustering algorithms including both single-linkage clustering and complete-linkage clustering are studied by employing word similarity measures as clustering criteria. Stopping criteria including Calinski & Harabasz, Hartigan and WB-index are used to find the proper hierarchical level in the clustering algorithms. Experiments on both synthetic datasets and real datasets are conducted and the results are analyzed. Acknowledgements I would like to thank Dr. Pasi Fränti for the advice, encouragement and support he provided to me in supervising this thesis eort. I would also like to thank Qinpei Zhao, for her critical analyses, technical advice and recommendations. Special thanks go to Dr. Olli Virmajoki for his reviews and recommendations.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An Intelligent System for Exact Word Retrieval in Document Databases

Automatic Information retrieval from document image databases is an important and challenging task. The main challenges are font style, size and spacing between characters. In order to meet the challenges, we propose a new technique for matching exact word string from document databases. For this approach, we address two issues: word identification and similarity measurement between documents. ...

متن کامل

Review of ranked-based and unranked-based metrics for determining the effectiveness of search engines

Purpose: Traditionally, there have many metrics for evaluating the search engine, nevertheless various researchers’ proposed new metrics in recent years. Aware of this new metrics is essential to conduct research on evaluation of the search engine field. So, the purpose of this study was to provide an analysis of important and new metrics for evaluating the search engines. Methodology: This is ...

متن کامل

A Survey of Text Similarity Approaches

Measuring the similarity between words, sentences, paragraphs and documents is an important component in various tasks such as information retrieval, document clustering, word-sense disambiguation, automatic essay scoring, short answer grading, machine translation and text summarization. This survey discusses the existing works on text similarity through partitioning them into three approaches;...

متن کامل

Textual Entailment as a Directional Relation

This paper presents three methods for solving the problem of textual entailment, obtained from an equal number of text-to-text similarity metrics. The first method starts with the directional measure of text-to-text similarity presented in Corley and Mihalcea (2005), and integrates word sense disambiguation and several heuristics. The second method exploits the relations between the cosine dire...

متن کامل

Idea-deriving Information Retrieval System

This paper presents the information retrieval system that integrates concept-based retrieval, which focuses on the similarity of the meanings of words, with the characterstring-matching-based retrieval. The system provides a word association function, a concept retrieval function, and a document classification function, which are expected to help the user to reach the target document quickly or...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2012

String Metrics and Word Similarity applied to Information Retrieval

نویسندگان

چکیده

منابع مشابه

An Intelligent System for Exact Word Retrieval in Document Databases

Review of ranked-based and unranked-based metrics for determining the effectiveness of search engines

A Survey of Text Similarity Approaches

Textual Entailment as a Directional Relation

Idea-deriving Information Retrieval System

عنوان ژورنال:

اشتراک گذاری